# Climate Change and Environmental Issues Dataset from Ukrainian Telegram Channels

## Overview

This repository contains two datasets that were collected and processed as part of a study on public perception of environmental issues and climate change in Ukraine. The datasets are derived from Ukrainian Telegram news channels and include metadata, raw text, and user reactions to posts related to climate events and environmental topics. These datasets are intended to support academic research on the relationship between public discourse, user sentiment, and climate indicators.

The datasets are located in the `data` folder with respect to their extension: `csv` and `parquet`. If you decide to read the `climate_text_data_final` in CSV format, please set the encoding to `utf-16`.


---

## Datasets

### 1. **`climate_text_data_final`**
This dataset contains raw text data from Telegram posts, along with additional metadata. It provides a comprehensive view of the content and context of climate-related discussions. The dataset can be joined with the `final_reactions_data` based on the `channel_name` and `message_id`. 

Please ensure the encoding is set to `utf-16` when reading the CSV format of the dataset. 

#### Key Features:
- **Post ID**: Unique identifier for each Telegram post.
- **Channel Name**: The name of the Telegram channel where the post was published.
- **Text**: The raw text of the Telegram post.
- **Metadata**: Includes timestamp, number of views, and number of forwards.

#### Purpose:
This dataset is designed to support natural language processing (NLP) tasks, such as topic modeling, named entity recognition, and sentiment analysis. It provides a foundation for understanding the themes and narratives surrounding climate change and environmental issues in Ukrainian online information space.

---


### 2. **`final_reactions_data`**
This dataset contains user reactions to Telegram posts, represented as emoji counts. It provides a detailed view of how users engage with climate-related content.

#### Key Features:
- **Post ID**: Unique identifier for each Telegram post.
- **Channel Name**: The name of the Telegram channel where the post was published.
- **Emoji Reactions**: Columns representing counts of various emojis used to react to the post.
- **Is NA**: A boolean value showing whether the emoji reaction columns have `NaN` or a at least one non-NA value.

#### Purpose:
This dataset enables researchers to analyze user sentiment and engagement with climate-related content. It can be used to identify patterns in public reactions to environmental issues and assess the emotional tone of the discourse. The emojis can be classified into categories to enable reduce dimensionality and work with a combined representation of emojis. Further, statistics on particular emoji class can be generated. This will lead to a solid understanding of user engagement patterns.

---


## Research Context

The datasets were collected as part of a study aimed at understanding public attitudes toward environmental issues and exploring the relationship between public perception and climate indicators, especially in the period of the full-scale Russian aggression against Ukraine. The study focused on Telegram channels due to their popularity and influence in Ukraine. The research objectives included:

1. Developing a methodology for automated data collection from Ukrainian Telegram channels on climate-related topics.
2. Conducting a comprehensive analysis of the collected data using natural language processing and statistical methods to identify key topics, trends, and patterns.
3. Investigating the relationship between message characteristics and user reactions to determine factors influencing public perception of environmental issues.

The study analyzed content from seven influential Telegram news channels: **DW Ukraine, BBC Ukrainian, Ukrayinska Pravda, Voice of America, Radio Liberty, Babel, and ZN.UA**. These channels were selected based on their audience size, credibility, and regularity of coverage of environmental issues. The data collection period spanned five years (01.01.2020 - 14.01.2025), allowing for an analysis of trends over time, including the impact of the Russian war in Ukraine on public discourse.

---

## Ethical Considerations

The datasets do not contain any personally identifiable information (PII). However, we acknowledge that the dataset may contain sensitive content due to the nature of the data. Some records may describe war-related activities, destruction, harm, or other sensitive topics. We have made every effort to remain unbiased in collecting data from the selected channels and have not censored any content. 

The dataset will undergo ethical clearance at Lviv Polytechnic National University to ensure compliance with ethical standards and guidelines for data collection, processing, and usage. This process aims to address potential concerns related to sensitive content and ensure the responsible use of the dataset in academic research.

### Recommendations for Ethical Use:
- **Fairness and Bias**: Evaluate results with fairness metrics to ensure that analyses are not biased or discriminatory.
- **Transparency**: Use tools for interpretability and explainability to ensure transparency in machine learning models and analyses.
- **Monitoring**: Implement machine learning monitoring to improve observability and awareness of system performance.
- **Ethical Awareness**: Be mindful of the potential for sensitive, distorted, or unfair content, particularly when analyzing topics related to war or conflict.
<!-- 
Ethical considerations are critical when working with datasets that may contain sensitive or harmful content. Researchers should take precautions to ensure that their analyses are fair, transparent, and respectful of the context in which the data was collected. -->

---

## Data Collection Methodology

To identify relevant messages, we used an approach based on the **Aho-Corasick algorithm**, which enables efficient multi-pattern search in text data with linear time complexity. This was critical for processing large volumes of information. A thematic dictionary was developed, containing key terms structured into five categories:
1. Climate terms
2. Environmental issues
3. Natural resources
4. Climate events
5. Environmental initiatives

The algorithm was implemented in Python using the `telethon` library for collecting messages and the `pyahocorasick` library for building a finite state machine for parallel pattern search. As a result, **5,732 relevant messages** related to climate change and environmental issues were identified and selected.

---

## Citation

If you use these datasets in your research, please cite the following [publication](https://ena.lpnu.ua/items/f20c3232-e67a-4f99-8502-80b05e9474f8):

> Ustianovych T. Climate event dataset based on Ukrainian online information space / Ustianovych Taras, Fedushko Solomiia // Information, communication, society 2025: ICS-2025 : Proceedings of the XIV International scientific conference, 22-24 May, 2025. Lviv : Lviv Politechnic Publishing House, 2025. — P. 73–74. — (Systems of artificial intelligence and machine learning).

---
<!-- 
## License

The datasets are released under the [Insert License Here]. Please review the license terms before using the data. -->

<!-- --- -->

## Contact

For questions or further information about the datasets, please contact:

- **Taras Ustyianovych**
- **Lviv Polytechnic National University**
- **taras.o.ustyianovych@lpnu.ua**